Computational techniques for spatial logistic regression with large data sets
نویسنده
چکیده
In epidemiological research, outcomes are frequently non-normal, sample sizes may be large, and effect sizes are often small. To relate health outcomes to geographic risk factors, fast and powerful methods for fitting spatial models, particularly for non-normal data, are required. I focus on binary outcomes, with the risk surface a smooth function of space, but the development herein is relevant for non-normal data in general. I compare penalized likelihood models, including the penalized quasi-likelihood (PQL) approach, and Bayesian models based on fit, speed, and ease of implementation.A Bayesian model using a spectral basis representation of the spatial surface via the Fourier basis provides the best tradeoff of sensitivity and specificity in simulations, detecting real spatial features while limiting overfitting and being reasonably computationally efficient. One of the contributions of this work is further development of this underused representation. The spectral basis model outperforms the penalized likelihood methods, which are prone to overfitting, but is slower to fit and not as easily implemented. A Bayesian Markov random field model performs less well statistically than the spectral basis model, but is very computationally efficient. We illustrate the methods on a real dataset of cancer cases in Taiwan.The success of the spectral basis with binary data and similar results with count data suggest that it may be generally useful in spatial models and more complicated hierarchical models.
منابع مشابه
Spatial Design for Knot Selection in Knot-Based Low-Rank Models
Analysis of large geostatistical data sets, usually, entail the expensive matrix computations. This problem creates challenges in implementing statistical inferences of traditional Bayesian models. In addition,researchers often face with multiple spatial data sets with complex spatial dependence structures that their analysis is difficult. This is a problem for MCMC sampling algorith...
متن کاملWeighted logistic regression for large-scale imbalanced and rare events data
Latest developments in computing and technology, along with the availability of large amounts of raw data, have led to the development of many computational techniques and algorithms. Concerning binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. Logis...
متن کاملSample size determination for logistic regression
The problem of sample size estimation is important in medical applications, especially in cases of expensive measurements of immune biomarkers. This paper describes the problem of logistic regression analysis with the sample size determination algorithms, namely the methods of univariate statistics, logistics regression, cross-validation and Bayesian inference. The authors, treating the regr...
متن کاملنقشهبرداری رقومی کلاسهای خاک با استفاده از نقشه خاک قدیمی در منطقه خشک جنوب شرق ایران
Mapping the spatial distribution of soil taxonomic classes is important for useful and effective use of soil and management decisions. Digital soil mapping (DSM) may have advantages over conventional soil mapping approaches as it may better capture observed spatial variability and reduce the need to aggregate soil types. A key component of any DSM activity is the method used to define the relat...
متن کاملAgris on-line Papers in Economics and Informatics
Developments in information technology has enabled accumulation of large databases and most of the environmental, agricultural and medical databases consist of large quantity of real time observatory datasets of high dimension space. The curse to these high dimensional datasets is the spatial and computational requirements, which leads to ever growing necessity of attribute reduction techniques...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Computational statistics & data analysis
دوره 51 8 شماره
صفحات -
تاریخ انتشار 2007